Search results for " High-dimensional data"

showing 10 items of 24 documents

Penalized regression and clustering in high-dimensional data

The main goal of this Thesis is to describe numerous statistical techniques that deal with high-dimensional genomic data. The Thesis begins with a review of the literature on penalized regression models, with particular attention to least absolute shrinkage and selection operator (LASSO) or L1-penalty methods. L1 logistic/multinomial regression models are used for variable selection and discriminant analysis with a binary/categorical response variable. The Thesis discusses and compares several methods that are commonly utilized in genetics, and introduces new strategies to select markers according to their informative content and to discriminate clusters by offering reduced panels for popul…

High-dimensional dataQuantile regression coefficients modelingTuning parameter selectionGenomic dataLasso regressionLasso regression; High-dimensional data; Genomic data; Tuning parameter selection; Quantile regression coefficients modeling; Curves clustering;Settore SECS-S/01 - StatisticaCurves clustering
researchProduct

An Extension of the DgLARS Method to High-Dimensional Relative Risk Regression Models

2020

In recent years, clinical studies, where patients are routinely screened for many genomic features, are becoming more common. The general aim of such studies is to find genomic signatures useful for treatment decisions and the development of new treatments. However, genomic data are typically noisy and high dimensional, not rarely outstripping the number of patients included in the study. For this reason, sparse estimators are usually used in the study of high-dimensional survival data. In this paper, we propose an extension of the differential geometric least angle regression method to high-dimensional relative risk regression models.

Clustering high-dimensional dataComputer sciencedgLARS Gene expression data High-dimensional data Relative risk regression models Sparsity · Survival analysisLeast-angle regressionRelative riskStatisticsEstimatorRegression analysisExtension (predicate logic)High dimensionalSettore SECS-S/01 - StatisticaSurvival analysis
researchProduct

Regularized Regression Incorporating Network Information: Simultaneous Estimation of Covariate Coefficients and Connection Signs

2014

We develop an algorithm that incorporates network information into regression settings. It simultaneously estimates the covariate coefficients and the signs of the network connections (i.e. whether the connections are of an activating or of a repressing type). For the coefficient estimation steps an additional penalty is set on top of the lasso penalty, similarly to Li and Li (2008). We develop a fast implementation for the new method based on coordinate descent. Furthermore, we show how the new methods can be applied to time-to-event data. The new method yields good results in simulation studies concerning sensitivity and specificity of non-zero covariate coefficients, estimation of networ…

Clustering high-dimensional databusiness.industryjel:C41jel:C13Machine learningcomputer.software_genreRegressionhigh-dimensional data gene expression data pathway information penalized regressionConnection (mathematics)Set (abstract data type)Lasso (statistics)CovariateArtificial intelligenceSensitivity (control systems)businessCoordinate descentAlgorithmcomputerMathematics
researchProduct

A local complexity based combination method for decision forests trained with high-dimensional data

2012

Accurate machine learning with high-dimensional data is affected by phenomena known as the “curse” of dimensionality. One of the main strategies explored in the last decade to deal with this problem is the use of multi-classifier systems. Several of such approaches are inspired by the Random Subspace Method for the construction of decision forests. Furthermore, other studies rely on estimations of the individual classifiers' competence, to enhance the combination in the multi-classifier and improve the accuracy. We propose a competence estimate which is based on local complexity measurements, to perform a weighted average combination of the decision forest. Experimental results show how thi…

Clustering high-dimensional dataComputational complexity theorybusiness.industryComputer scienceDecision treeMachine learningcomputer.software_genreRandom forestRandom subspace methodArtificial intelligenceData miningbusinessCompetence (human resources)computerClassifier (UML)Curse of dimensionality2012 12th International Conference on Intelligent Systems Design and Applications (ISDA)
researchProduct

Distance Functions, Clustering Algorithms and Microarray Data Analysis

2010

Distance functions are a fundamental ingredient of classification and clustering procedures, and this holds true also in the particular case of microarray data. In the general data mining and classification literature, functions such as Euclidean distance or Pearson correlation have gained their status of de facto standards thanks to a considerable amount of experimental validation. For microarray data, the issue of which distance function works best has been investigated, but no final conclusion has been reached. The aim of this extended abstract is to shed further light on that issue. Indeed, we present an experimental study, involving several distances, assessing (a) their intrinsic sepa…

Clustering high-dimensional dataFuzzy clusteringSettore INF/01 - Informaticabusiness.industryCorrelation clusteringMachine learningcomputer.software_genrePearson product-moment correlation coefficientRanking (information retrieval)Euclidean distancesymbols.namesakeClustering distance measuressymbolsArtificial intelligenceData miningbusinessCluster analysiscomputerMathematicsDe facto standard
researchProduct

SparseHC: A Memory-efficient Online Hierarchical Clustering Algorithm

2014

Computing a hierarchical clustering of objects from a pairwise distance matrix is an important algorithmic kernel in computational science. Since the storage of this matrix requires quadratic space with respect to the number of objects, the design of memory-efficient approaches is of high importance to this research area. In this paper, we address this problem by presenting a memory-efficient online hierarchical clustering algorithm called SparseHC. SparseHC scans a sorted and possibly sparse distance matrix chunk-by-chunk. Meanwhile, a dendrogram is built by merging cluster pairs as and when the distance between them is determined to be the smallest among all remaining cluster pairs. The k…

sparse matrixClustering high-dimensional dataTheoretical computer scienceonline algorithmsComputer scienceSingle-linkage clusteringComplete-linkage clusteringNearest-neighbor chain algorithmConsensus clusteringmemory-efficient clusteringCluster analysisk-medians clusteringGeneral Environmental ScienceSparse matrix:Engineering::Computer science and engineering [DRNTU]k-medoidsDendrogramConstrained clusteringHierarchical clusteringDistance matrixCanopy clustering algorithmGeneral Earth and Planetary SciencesFLAME clusteringHierarchical clustering of networkshierarchical clusteringAlgorithmProcedia Computer Science
researchProduct

Sparse relative risk regression models

2020

Summary Clinical studies where patients are routinely screened for many genomic features are becoming more routine. In principle, this holds the promise of being able to find genomic signatures for a particular disease. In particular, cancer survival is thought to be closely linked to the genomic constitution of the tumor. Discovering such signatures will be useful in the diagnosis of the patient, may be used for treatment decisions and, perhaps, even the development of new treatments. However, genomic data are typically noisy and high-dimensional, not rarely outstripping the number of patients included in the study. Regularized survival models have been proposed to deal with such scenarios…

Statistics and ProbabilityClustering high-dimensional dataComputer sciencedgLARSInferenceScale (descriptive set theory)BiostatisticsMachine learningcomputer.software_genreRisk Assessment01 natural sciencesRegularization (mathematics)Relative risk regression model010104 statistics & probability03 medical and health sciencesNeoplasmsCovariateHumansComputer Simulation0101 mathematicsOnline Only ArticlesSurvival analysis030304 developmental biology0303 health sciencesModels Statisticalbusiness.industryLeast-angle regressionRegression analysisGeneral MedicineSurvival AnalysisHigh-dimensional dataGene expression dataRegression AnalysisArtificial intelligenceStatistics Probability and UncertaintySettore SECS-S/01 - StatisticabusinessSparsitycomputerBiostatistics
researchProduct

Structural clustering of millions of molecular graphs

2014

We propose an algorithm for clustering very large molecular graph databases according to scaffolds (i.e., large structural overlaps) that are common between cluster members. Our approach first partitions the original dataset into several smaller datasets using a greedy clustering approach named APreClus based on dynamic seed clustering. APreClus is an online and instance incremental clustering algorithm delaying the final cluster assignment of an instance until one of the so-called pending clusters the instance belongs to has reached significant size and is converted to a fixed cluster. Once a cluster is fixed, APreClus recalculates the cluster centers, which are used as representatives for…

Clustering high-dimensional dataFuzzy clusteringTheoretical computer sciencek-medoidsComputer scienceSingle-linkage clusteringCorrelation clusteringConstrained clusteringcomputer.software_genreComplete-linkage clusteringGraphHierarchical clusteringComputingMethodologies_PATTERNRECOGNITIONData stream clusteringCURE data clustering algorithmCanopy clustering algorithmFLAME clusteringAffinity propagationData miningCluster analysiscomputerk-medians clusteringClustering coefficientProceedings of the 29th Annual ACM Symposium on Applied Computing
researchProduct

A fast and recursive algorithm for clustering large datasets with k-medians

2012

Clustering with fast algorithms large samples of high dimensional data is an important challenge in computational statistics. Borrowing ideas from MacQueen (1967) who introduced a sequential version of the $k$-means algorithm, a new class of recursive stochastic gradient algorithms designed for the $k$-medians loss criterion is proposed. By their recursive nature, these algorithms are very fast and are well adapted to deal with large samples of data that are allowed to arrive sequentially. It is proved that the stochastic gradient algorithm converges almost surely to the set of stationary points of the underlying loss criterion. A particular attention is paid to the averaged versions, which…

Statistics and ProbabilityClustering high-dimensional dataFOS: Computer and information sciencesMathematical optimizationhigh dimensional dataMachine Learning (stat.ML)02 engineering and technologyStochastic approximation01 natural sciencesStatistics - Computation010104 statistics & probabilityk-medoidsStatistics - Machine Learning[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST]stochastic approximation0202 electrical engineering electronic engineering information engineeringComputational statisticsrecursive estimatorsAlmost surely[ MATH.MATH-ST ] Mathematics [math]/Statistics [math.ST]0101 mathematicsCluster analysisComputation (stat.CO)Mathematicsaveragingk-medoidsRobbins MonroApplied MathematicsEstimator[STAT.TH]Statistics [stat]/Statistics Theory [stat.TH]stochastic gradient[ STAT.TH ] Statistics [stat]/Statistics Theory [stat.TH]MedoidComputational MathematicsComputational Theory and Mathematicsonline clustering020201 artificial intelligence & image processingpartitioning around medoidsAlgorithm
researchProduct

A novel heuristic memetic clustering algorithm

2013

In this paper we introduce a novel clustering algorithm based on the Memetic Algorithm meta-heuristic wherein clusters are iteratively evolved using a novel single operator employing a combination of heuristics. Several heuristics are described and employed for the three types of selections used in the operator. The algorithm was exhaustively tested on three benchmark problems and compared to a classical clustering algorithm (k-Medoids) using the same performance metrics. The results show that our clustering algorithm consistently provides better clustering solutions with less computational effort.

ta113Determining the number of clusters in a data setBiclusteringClustering high-dimensional dataDBSCANComputingMethodologies_PATTERNRECOGNITIONTheoretical computer scienceCURE data clustering algorithmCorrelation clusteringCanopy clustering algorithmCluster analysisAlgorithmMathematics2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP)
researchProduct